Introduction - SAE and R
November 2025
Let’s take a look at Malawi
Why Malawi?
Consider the 2019/2020 Integrated Household Survey (IHS5)
The goal for today is to give you a brief introduction to R and R Markdown
We will be using two small datasets to get you familiar with the program
A note: if you are completely new to R, the first few weeks will be a slog
Much of the material covered today comes from two (free!) sources:
# lasso --------------------------------------------
# we have ~60 features. This isn't that many, actually. We didn't create a lot of different possible combinations of the predictors.
# We also don't have any fixed effects. This is just to fix ideas. Nonetheless, let's try lasso!
# we use the glmnet package to implement lasso. It also allows ridge, but we want to make sure to use lasso.
# how do we do this? we want to allocate grid cells across different "folds".You can run a line of code by clicking the “Run” button
You can run multiple lines of code by highlighting them and clicking the “Run” button (or the shortcut)
We will practice these later
tidyverse package (more below)c() function:
<-. You can also use =.1class() function
class(vec)is.vector(vec)
The working directory should be where you opened the file from. Check it like this:
The one exception to always using a script? I install packages in the CONSOLE. You can install packages like this:
We need to load any R packages we want to use at the very top of the script. You should have a comment on line one, so on line two write:
This will load the tidyverse package.
read_csv() function to load the data
<-). You can actually use = though.The data frame should show up in the upper right hand corner of RStudio.
Click on the arrow and it will show more information.
[1] "res_id" "ability" "age"
[4] "educyears" "isfarmer" "yearsfarming"
[7] "yearsmanagingfarm" "outsidewage" "worriedaboutcropprices"
[10] "worriedaboutcropyields"
Rows: 7,209
Columns: 10
$ res_id <dbl> 501, 502, 503, 504, 505, 506, 507, 508, 509, 51…
$ ability <dbl> 74, 42, 67, 54, 57, 72, 51, 65, 54, 24, 24, 49,…
$ age <dbl> 83, 27, 49, 50, 70, 45, 58, 41, 45, 70, 24, 45,…
$ educyears <chr> "16", "7", "7", "7", "4", "7", "7", "7", "7", N…
$ isfarmer <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ yearsfarming <dbl> 60, 17, 20, 15, 40, 15, 30, 20, 20, 60, 12, 30,…
$ yearsmanagingfarm <dbl> 46, 17, 5, 10, 26, 15, 25, 15, 10, 50, 7, 25, 5…
$ outsidewage <dbl> 3.000e+06, 1.000e+10, 6.000e+06, 1.000e+10, 1.0…
$ worriedaboutcropprices <chr> "Sometimes", "Not at all", "Sometimes", "Not at…
$ worriedaboutcropyields <chr> "Sometimes", "Sometimes", "Not at all", "Someti…
spc_tbl_ [7,209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ res_id : num [1:7209] 501 502 503 504 505 506 507 508 509 510 ...
$ ability : num [1:7209] 74 42 67 54 57 72 51 65 54 24 ...
$ age : num [1:7209] 83 27 49 50 70 45 58 41 45 70 ...
$ educyears : chr [1:7209] "16" "7" "7" "7" ...
$ isfarmer : chr [1:7209] "Yes" "Yes" "Yes" "Yes" ...
$ yearsfarming : num [1:7209] 60 17 20 15 40 15 30 20 20 60 ...
$ yearsmanagingfarm : num [1:7209] 46 17 5 10 26 15 25 15 10 50 ...
$ outsidewage : num [1:7209] 3e+06 1e+10 6e+06 1e+10 1e+10 ...
$ worriedaboutcropprices: chr [1:7209] "Sometimes" "Not at all" "Sometimes" "Not at all" ...
$ worriedaboutcropyields: chr [1:7209] "Sometimes" "Sometimes" "Not at all" "Sometimes" ...
- attr(*, "spec")=
.. cols(
.. res_id = col_double(),
.. ability = col_double(),
.. age = col_double(),
.. educyears = col_character(),
.. isfarmer = col_character(),
.. yearsfarming = col_double(),
.. yearsmanagingfarm = col_double(),
.. outsidewage = col_double(),
.. worriedaboutcropprices = col_character(),
.. worriedaboutcropyields = col_character()
.. )
- attr(*, "problems")=<externalptr>
$ operatordata$age res_id ability age educyears
Min. : 501 Min. : 10.00 Min. :18.00 Length:7209
1st Qu.:2783 1st Qu.: 51.00 1st Qu.:34.00 Class :character
Median :4714 Median : 59.00 Median :42.00 Mode :character
Mean :4775 Mean : 58.66 Mean :43.54
3rd Qu.:6764 3rd Qu.: 67.00 3rd Qu.:52.00
Max. :8955 Max. :100.00 Max. :87.00
isfarmer yearsfarming yearsmanagingfarm outsidewage
Length:7209 Min. :-9.00 Min. :-9.00 Min. :2.000e+03
Class :character 1st Qu.:18.00 1st Qu.: 6.00 1st Qu.:3.500e+06
Mode :character Median :26.00 Median :14.00 Median :1.000e+10
Mean :28.02 Mean :15.94 Mean :7.156e+09
3rd Qu.:38.00 3rd Qu.:22.00 3rd Qu.:1.000e+10
Max. :70.00 Max. :70.00 Max. :1.000e+10
NA's :219 NA's :219 NA's :216
worriedaboutcropprices worriedaboutcropyields
Length:7209 Length:7209
Class :character Class :character
Mode :character Mode :character
data[1,2]
data[,2]# A tibble: 10 × 9
ability age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 74 83 16 Yes 60 46 3000000
2 42 27 7 Yes 17 17 9999999999
3 67 49 7 Yes 20 5 6000000
4 54 50 7 Yes 15 10 9999999999
5 57 70 4 Yes 40 26 9999999999
6 72 45 7 Yes 15 15 800000
7 51 58 7 Yes 30 25 2000000
8 65 41 7 Yes 20 15 9999999999
9 54 45 7 Yes 20 10 300000
10 24 70 <NA> Yes 60 50 9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
# worriedaboutcropyields <chr>
# A tibble: 10 × 9
ability age educyears isfarmer yearsfarming yearsmanagingfarm outsidewage
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 74 83 16 Yes 60 46 3000000
2 42 27 7 Yes 17 17 9999999999
3 67 49 7 Yes 20 5 6000000
4 54 50 7 Yes 15 10 9999999999
5 57 70 4 Yes 40 26 9999999999
6 72 45 7 Yes 15 15 800000
7 51 58 7 Yes 30 25 2000000
8 65 41 7 Yes 20 15 9999999999
9 54 45 7 Yes 20 10 300000
10 24 70 <NA> Yes 60 50 9999999999
# ℹ 2 more variables: worriedaboutcropprices <chr>,
# worriedaboutcropyields <chr>
dbl is short for double, which is a numeric variable (the “type” of numeric variable is about how much memory is needed to store it)chr is short for character, which is a string of characters (text)
educyears was a character string even though it seemed to be a numbereducyears using the unique() function, which outputs a vector:==) to check whether the value is “Not Mentioned”
[1] "16" "7" "4" NA "11" "6" "13" "5" "8" "10" "12" "9" "2" "3" "15"
[16] "14" "20" "18" "17" "1" "19"
[1] "numeric"
|>)
Here is an example of how we can use pipes with the mutate() function in tidyverse
ifelse() to make this work Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 7.000 7.000 6.735 7.000 20.000 3113
Note that we could wrap as.numeric() around the ifelse() command to do it on one line!
In Stata, by default, functions ignore missing values
[1] NA
If there are any missing values, the function will evalute to missing!
The mean() function in the previous slide outputs a single value - That means we could store that value as an object:
[1] 6.735107
[1] 2.404086
How is this helpful? We can use these values later in our script!
We can combine the mean() and sd() functions within mutate to create a new, standardized variable:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
NA NA NA NaN NA NA 7209
Oh no! what happened?
We can combine the mean() and sd() functions within mutate to create a new, standardized variable:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.3856 0.1102 0.1102 0.0000 0.1102 5.5176 3113
Note that we can shorten TRUE to T (or FALSE to F).
First install a new package that has a dataset we will use (you can do this in the console):
Now let’s see:
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
There’s a nice package called lubridate that makes working with dates much easier.
Departure time/arrival time is in the format HHMM (e.g., 1530 is 3:30pm). We can add this to the date
[1] "5H 17M 0S" "5H 33M 0S" "5H 42M 0S" "5H 44M 0S" "5H 54M 0S" "5H 54M 0S"
[7] "5H 55M 0S" "5H 57M 0S" "5H 57M 0S" "5H 58M 0S" "5H 58M 0S" "5H 58M 0S"
[13] "5H 58M 0S" "5H 58M 0S" "5H 59M 0S" "5H 59M 0S" "5H 59M 0S" "6H 0M 0S"
[19] "6H 0M 0S" "6H 1M 0S"
Lubridate also lets us work with “periods”
Let’s get the average departure delay by NYC airport:
# A tibble: 3 × 2
origin avg_dep_delay
<chr> <dbl>
1 EWR 15.1
2 JFK 12.1
3 LGA 10.3
Note that this does not create a single value. Instead it creates a tibble (a data frame) summarizing the data by our grouping variable.
What if we want to save that tibble instead?
# A tibble: 3 × 2
origin avg_dep_delay
<chr> <dbl>
1 EWR 15.1
2 JFK 12.1
3 LGA 10.3
I could then output this to a table if I wanted to (using Markdown, more on this later):
| origin | avg_dep_delay |
|---|---|
| EWR | 15.10795 |
| JFK | 12.11216 |
| LGA | 10.34688 |
How does departure delay vary by time of day?
We can color code by origin, too!
hh_b04 (age)hh_b11 (main occupation) [1] "AGRICULTURE / LIVESTOCK"
[2] "PRIVATE SECTOR"
[3] "student"
[4] "EMPLOYED (NOT AG): WITHOUT EMPLOYEES"
[5] "UNPAID FAMILY WORK"
[6] "NO JOB"
[7] "TOO YOUNG"
[8] NA
[9] "EMPLOYED (NOT AG): WITH EMPLOYEES"
[10] "tourism"
[11] "JOB SEEKERS"
[12] "disabled"
[13] "PAID FAMILY WORK"
[14] "goverment"
[15] "NGO/RELIGIOUS"
[16] "parastatal"
[17] "mining"
[18] "fishing"
- The relevant two values are “AGRICULTURE / LIVESTOCK” and “student”
cut function
df <- df |>
1 mutate(age_groups = cut(hh_b04, breaks = seq(from = 0, to = max(df$hh_b04), by = 5)))
head(df$age_groups)age_groups based on the conditions given (line 3)
seq function creates a sequence of numbers from 0 to the maximum age in the dataset, in increments of 5age_groupsage_groups
ag and ed (removing missing values)
aes().ggplot() +
geom_point(data = dfgroups,
aes(x = age_groups, y = ag, color = "Ag")) +
geom_point(data = dfgroups,
aes(x = age_groups, y = ed, color = "Educ.")) +
labs(x = "Age group",
y = "Proportion of respondents in category") +
scale_color_manual("Occupation:",
values = c("Ag" = colors[1], "Educ." = colors[2])) +
theme_bw()ggplot() +
geom_point(data = dfgroups,
aes(x = age_groups, y = ag, color = "Ag")) +
geom_point(data = dfgroups,
aes(x = age_groups, y = ed, color = "Educ.")) +
labs(x = "Age group",
y = "Proportion of respondents in category") +
scale_color_manual("Occupation:",
values = c("Ag" = colors[1], "Educ." = colors[2])) +
theme_bw()ggplot() +
geom_point(data = dfgroups,
aes(x = age_groups, y = ag, color = "Ag")) +
geom_point(data = dfgroups,
aes(x = age_groups, y = ed, color = "Educ.")) +
labs(x = "Age group",
y = "Proportion of respondents in category") +
scale_color_manual("Occupation:",
values = c("Ag" = colors[1], "Educ." = colors[2])) +
theme_bw() +
theme(legend.position = c(0.9, 0.8))ggplot() +
1 geom_line(data = dfgroups, aes(x = as.numeric(age_groups), y = ag, color = "Ag")) +
geom_line(data = dfgroups, aes(x = as.numeric(age_groups), y = ed, color = "Educ.")) +
2 scale_x_continuous(breaks = as.numeric(unique(dfgroups$age_groups)), labels = unique(dfgroups$age_groups)) +
labs(x = "Age group", y = "Proportion of respondents in category") +
scale_color_manual("Occupation:", values = c("Ag" = colors[1], "Educ." = colors[2])) +
theme_bw() +
theme(legend.position = c(0.9, 0.8))ggplot() +
geom_point(data = dfgroups, aes(x = as.numeric(age_groups), y = ag, color = "Ag")) +
geom_line(data = dfgroups, aes(x = as.numeric(age_groups), y = ag, color = "Ag")) +
geom_point(data = dfgroups, aes(x = as.numeric(age_groups), y = ed, color = "Educ.")) +
geom_line(data = dfgroups, aes(x = as.numeric(age_groups), y = ed, color = "Educ.")) +
scale_x_continuous(breaks = as.numeric(unique(dfgroups$age_groups)), labels = unique(dfgroups$age_groups)) +
labs(x = "Age group", y = "Proportion of respondents in category") +
scale_color_manual("Occupation:", values = c("Ag" = colors[1], "Educ." = colors[2])) +
theme_bw() +
theme(legend.position = c(0.9, 0.8))